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Abstract 

Background: Chromatin immunoprecipitation followed by deep sequencing (ChlP-seq) is the most widely used 
method for characterizing the epigenetic states of chromatin on a genomic scale. With the recent availability of large 
genome-wide data sets, often comprising several epigenetic marks, novel approaches are required to explore 
functionally relevant interactions between histone modifications. Computational discovery of "chromatin states" 
defined by such combinatorial interactions enabled descriptive annotations of genomes, but more quantitative 
approaches are needed to progress towards predictive models. 

Results: We propose non-negative matrix factorization (NMF) as a new unsupervised method to discover 
combinatorial patterns of epigenetic marks that frequently co-occur in subsets of genomic regions. We show that this 
small set of combinatorial "codes" can be effectively displayed and interpreted. NMF codes enable dimensionality 
reduction and have desirable statistical properties for regression and classification tasks. We demonstrate the utility of 
codes in the quantitative prediction of Pol2-binding and the discrimination between Pol2-bound promoters and 
enhancers. Finally, we show that specific codes can be linked to molecular pathways and targets of pluripotency 
genes during differentiation. 

Conclusions: We have introduced and evaluated a new computational approach to represent combinatorial 
patterns of epigenetic marks as quantitative variables suitable for predictive modeling and supervised machine 
learning. To foster widespread adoption of this method we make it available as an open-source software-package - 
epicode at https://github.com/mcieslik-mctp/epicode. 



Background 

Biochemical and structural properties of chromatin are 
implicated in the function and maintenance of genomes 
(e.g. [1]). Chromatin immunoprecipitation followed by 
deep sequencing (ChlP-seq) is becoming the standard 
method for the genome-wide mapping of histone modifi- 
cations and transcription factor (TF) binding sites [2]. 

The analysis and interpretation of ChlP-seq data sets 
is a difficult task [3]. Most of the existing analysis tools 
are focused on the delineation of enriched sites from a 
single sample with optional "input control" [4]. For his- 
tone modifications this task becomes more challenging 
as their enrichments are often weaker and less local- 
ized. A number of groups have extended the peak-calling 
approach to identify broad domains [5,6] or analytically 
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represent ChlP-signals beyond read-counts [7]. In order 
to link epigenetic marks to biological functions and pro- 
cesses, peak calling has also been adapted to paired exper- 
imental designs [8]. Individually, each epigenetic mark 
provides some data towards understanding the structure 
and biochemistry of the underlying genome. However, it 
has been argued that the cooperative action of multiple 
histone modifications, variants, and TFs is functionally 
most informative [9,10]. Unfortunately, none of the stan- 
dard peak-based method deals with multiple marks and 
the reconciliation of several sets of peaks is an added 
challenge [11-13]. 

An alternative, and orthogonal, approach is to inte- 
grate individual histone modification maps to discover 
latent relationships between epigenetic marks. Broadly, 
these approaches fall into two categories: genome- wide 
segmentation and locus-based clustering. For example, 
ChromHMM and Segway [14,15] partition the genome 
into epigenetically-similar regions and have been able 
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to reliably associate chromatin profiles with transcrip- 
tion start sites and putative enhancer regions [16]. 
Similarly, clustering approaches, such as ChromaSig, 
attempt to identify loci with globally congruent "chro- 
matin signatures". Although the two types of methods 
differ greatly in the statistical modeling of data, they 
make the general assumption that a small set of "chro- 
matin states" is sufficient to annotate the genome [17]. 
Experimental results suggests that these models are too 
restricted to capture the genome-wide variability of chro- 
matin patterns [16]. The number of global "chromatin 
states" has been estimated to be in the several hundreds 
even when only a small set of marks is used to define each 
pattern [18]. 

Both clustering and segmentation results in the hard 
assignment of a single "chromatin signature" to each 
locus. This allows for certain types of functional enrich- 
ment analyses [19], but is not, in general, conductive to 
quantitatively link "chromatin state" to genome biology. 
Regression and other supervised machine learning tech- 
nique are needed to move from descriptive annotations to 
quantitative and predictive models [20]. In most of these 
approaches, levels of epigenetic signals are linked to a bio- 
logically important readout {e.g. transcript level [21,22] or 
polymerase occupancy [20]). Unfortunately, histone mod- 
ifications tend to be highly correlated, which makes it 
difficult to asses the relative importance of the variables 
(marks) [23] . Since these problems are further exacerbated 
during stepwise regression, it is difficult to explain how, in 
terms of direction and strength, combinatorial interactions 
between marks are linked to the biological readout [24] . 

Here, we describe a novel method based on non- 
negative matrix factorization (NMF) to discover com- 
binatorial patterns of epigenetic marks from integrated 
epigenetic data sets. Locus-specific weights of these mark 
co-occurrence patterns are used as quantitative variables, 
suitable for regression and supervised machine learn- 
ing. We are able to demonstrate that basis patterns are 
quantitative predictors of biochemical activity, discrimi- 
nate between classes of genomic regions, and are asso- 
ciated with molecular pathways. Hence we propose to 
call these patterns bona fide epigenetic "codes". In the 
remaining sections we describe the basic algorithm and its 
extensions (Formulation), investigate important statistical 
properties of basis patterns (Properties), and show their 
utility in regression, classification, and gene set analysis 
(Case Studies). A reference implementation of the method 
is available at https://github.com/mcieslik-mctp/epicode 
and in (Additional file 1). 

Results 

Formulation 

The total number of distinct "chromatin states" in 
the genome is likely inestimable, but clearly specific 



combinations of a small number of marks are associated 
with distinct functions or region classes [18,25]. Rather 
than trying to delineate global "chromatin states" we 
attempt to identify patterns of marks that frequently co- 
occur in subsets of genomic regions. We anticipate marks 
within a combinatorial pattern to be "written" or "erased" 
by the same chromatin remodeling complex or during the 
same reprogramming event, which results in their high 
correlation. Along the lines of the original "histone code" 
hypothesis [9] we expect these patterns to either, encode 
biochemical signals that are recognized by multivalent 
epigenetic "readers" [26], or to represent coordinated epi- 
genetic regulation [27,28] . We introduce a method which 
represents the full set of histone modifications or variants 
occurring at a selected annotation class {i.e.y promoters 
or enhancers) across a genome in terms of a small set of 
co-occurring "basis" patterns. We will refer to these basis 
patterns as "codes". In contrast to previous approaches 
[16,17,25] we attempt to represent the unique "chromatin 
signature" at each locus as a weighted superposition of 
multiple basis patterns {i.e. each locus will be a linear 
combination of several codes with non-zero weights). We 
formulate the task of epigenetic code discovery in the 
framework of non-negative matrix factorization (NMF) 
[29,30]. This method transforms an input matrix V into 
two factor matrices H and W: 

V^WH 

In the context of epigenomics V is a matrix of the 
observed "chromatin signatures". Each row of this matrix 
is an arbitrary user defined locus e.g. a region of 2 kbp 
flanking a transcription start site (TSS). Each column 
quantifies the level of a histone modification and is a 
function of the number of reads mapping to at least one 
base pair within this locus. // is a small matrix of sparse 
basis patterns, technically called basis vectors, which we 
refer to as codes, and is a matrix of weights to recon- 
struct V using the codes in H (Figure IB). Within a single 
basis pattern highly correlated input variables have pos- 
itive values. We observed that for epigenetic marks the 
NMF algorithm yields a sparse matrix H. The result- 
ing basis vectors in H are dissimilar and interpretable 
e.g. (Figure 2B). Unlike other matrix factorization meth- 
ods, NMF is suitable for this particular task because it 
constrains both H and W to be non-negative. Given a fac- 
torization V ^ WH we can assign code labels to genes by 
finding, for each gene, the code with the highest weight 
in Wy which is analogous to "hard" cluster assignment in 
K-means [29,31]. 

The basic NMF procedure randomly initializes matri- 
ces W and H and minimizes the reconstruction error 
V — WH - difference between the actual and model out- 
put values of the epigenetic factor levels - by updating 
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Figure 1 NMF-based algorithms for epigenomic data. (A) Schematic representation of the transformation of ChlP-seq data, (top) transformation 
of read counts into elements of the V matrix in the basic "absolute" mode. Each mark - locus combination is a single element in matrix V. Reads at 
each locus are summed. Columns of V are additionally scaled, (bottom) transformation of read counts from paired samples in "differential" mode. 
The differential signal is obtained by subtracting sample A coverage from sample B coverage after correction for sequencing-depth. Positive and 
negative area under curve is summed (integrated) into "gain" and "loss" scores. (B) NMF factorization in the "absolute" algorithm. V matrix is the 
same as the top of sub-panel (A). (C) NMF factorization in the "discriminatory" algorithm, contains two classes of loci V] and Vj, which are used to 
derive two independent basis pattern matrices H] and H2. All codes are concatenated and used to derive a single matrix 1/1/] 2. (D) NMF factorization 
in the "differential" algorithm. The V matrix corresponds to the bottom of sub-panel (A) and contains columns for the "gain" and "loss" of each mark. 



W and H using a projected gradient algorithm [32]. This 
algorithm finds only local optima that depend on the 
starting conditions, analogously to the common /C-means 
algorithm. For the initialization of the NMF algorithm 
we propose to use the deterministic non-negative dou- 
ble singular value decomposition (NNDSVD) technique 
[33]. The NMF algorithm depends on a single param- 
eter c - the rank of the factor matrices, which is the 
expected number of basis patterns. As with most unsu- 
pervised algorithms the choice of c is not straightforward. 
A large c results in sparse codes and few combinatorial 
interactions. Sparsity of the output is a prominent feature 
of NMF, but is further enhanced by additional constraints 
[34] . In our implementation the constraints are applied to 
the matrix H and thus favor combinatorial patterns of only 
few histone marks. To illustrate the sensitivity of NMF to 
initialization and the relative performance of NNDSVD 
for epigenomic data we compared the default factoriza- 
tion with a random initialization approach (Figure 2C). 
We found that this approach has the smallest reconstruc- 
tion error (the objective function of NMF) and largest 
sparsity. Further, randomly initialized solutions tend to 
have a smaller reconstruction error if their H matrix is 
more similar to the NNDSVD solution (Additional file 2: 
Figure SI). Together, these results show that NMF out- 
put is sensitive to initialization. However, the NNDSVD 



approach yields a solution that outperforms even a large 
number of random runs. 

We develop three complementary approaches which 
apply the NMF algorithm on epigenomic profile data for 
distinct tasks of prediction, classification and associa- 
tion: (1) absolute, (2) discriminatory and (3) differential 
(Figure 1). As shown in (Figure lA-B), the absolute algo- 
rithm performs NMF on the quantified levels of epigenetic 
marks at one annotation class {e.g., promoters). The dis- 
criminatory algorithm performs NMF on quantified levels 
of the same set of epigenetic marks at two classes of loci 
{e.g., promoters and enhancers) (Figure IC). As depicted in 
Figure ID, the differential algorithm performs NMF on nor- 
malized differential epigenetic levels - gains and losses - 
between two cell lines or cell states {e.g. stem cells vs 
differentiated cells). 

To construct V from genome-wide maps of multiple 
histone modifications, we individually quantify and scale 
"absolute" signals of epigenetic marks at each queried 
locus (Figure lA top). Each row of the input matrix V rep- 
resents scaled levels of epigenetic marks within a single 
locus. Some form of column normalization, or scaling, is 
usually necessary to account for the differences in magni- 
tudes and dynamic ranges of histone modifications, and 
to reveal the patterns of interest [35]. By default we use a 
sigmoid function to normalize all signals to 0 to 1 range 
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Figure 2 Properties of NMF on epigenomic data. (A) Average Sparsity of factor matrices 1/1/ and /-/ as a function of c calculated using Hoyer's 
formula (see Results). (B) Approximate nesting of "basis patterns" for factorizations using different factorization ranks c. Nesting occurs if a lower rank 
basis vector is approximately derived by adding two higher rank basis vectors. (C) NMF reconstruction error and sparsity on 1 000 random 
initializations of matrices H and 1/1/. NNDSVD - deterministic non-negative double singular value decomposition, random - uniformly (0 to 1 range) 
random non-negative matrices. The 1 000 NNDSVDar randomizations find approximately the same factorization as NNDSVD and are not shown. 



as this has been shown to accelerate and improve NMF 
[36,37]. 

Different classes of genomic regions, such as promot- 
ers and enhancers, show discriminatory epigenetic pat- 
terns [25]. Regulatory mechanism operating at distinct 
classes often have a unique epigenetic component, such 
as the activity of a specific chromatin remodeling complex 
{e.g. [38]). Thus, it is reasonable to assume that specific 
or enriched combinatorial patterns could discriminate 



between classes of sites. To identify such specific codes 
we propose the "discriminatory" algorithm (Figure IC). In 
this mode we first apply the "absolute" algorithm at each 
set of k genomic regions separately and next reconstruct 
a single weight matrix. Specifically, we partition input 
matrix V into k sub-matrices 1//, each of these matrices 
is independently factored Vi = WiHi, next we concate- 
nate the k Hi matrices into a single matrix H. Finally the 
matrix W is obtained using non-negative least squares 
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from matrices V and H, Intuitively, we first discover opti- 
mal codes for each class of genomic regions and next allow 
all codes to be used to describe the "chromatin signa- 
ture" at each locus regardless of its class. If the epigenetic 
patterns at different classes of sites have the same latent 
structure, the discovered class-specific codes will be very 
similar or interchangeable. In both cases codes discovered 
for one class of sites will be useful to encode the epigenetic 
features of other classes of loci. On the other hand, if the 
latent epigenetic structure of the different region types is 
dissimilar, some of the discovered codes will be discrim- 
inatory and not useful to encode epigenetic features of 
other classes. 

Histone modification levels are dynamic and are due to 
net changes in the activity of modifying enzymes called 
"writers" and "erasers" [39]. Relative to a second sample 
a locus might show "gain", "loss", or, if it is sufficiently 
large, both "gain" and "loss" of a histone mark. Although 
chromatin remodelling complexes often have multiple 
catalytic activities and substrate cross-reactivities [40], 
simultaneous changes to multiple marks at a subset of 
loci might suggest a shared regulatory mechanisms or 
function [41]. Therefore, we define basis patterns in the 
dynamic context as coordinated changes to histone mod- 
ification levels. Analogously to the "absolute" and "dis- 
criminatory" cases, in "differential" mode (Figure ID, 
Figure lA bottom), mark levels are quantified within each 
query locus. However, because paired samples are typ- 
ically sequenced to different depths, the mapped read 
counts are normalized using the DESeq algorithm [42]. 
Within each locus absolute signals are transformed into 
differential "gain-loss" scores (Figure lA bottom). This 
approach results in twice the number of columns in 
V -two for each epigenetic mark. Histone modifica- 
tion levels are spatially auto-correlated. In "absolute" and 
"discriminatory" modes we rely on this property to cal- 
culate average enrichment levels within a possibly large 
locus. Much less is known about the auto-correlation of 
differential (subtracted) levels. Therefore we divide each 
locus into adjustable windows (default 100 bp). For each 
window paired ChIP signals are subtracted resulting in a 
net "gain" or "loss" of a histone modification. We obtain 
the final per-locus "gain" score by summing the windows 
with a net "gain", and the "loss" score by summing win- 
dows with a net "loss". If the differential signal is strongly 
auto-correlated most windows within a locus will show 
"gains" or "losses" and the whole locus will show only 
"gain" or "loss". A simple example shows that this is not 
always the case. If a peak is broadened it results in "losses" 
at the summit but "gains" at the slopes. Integrating over 
windows with sizes in the range of ChlP-seq resolution 
(hundereds of base pairs) allows us to differentiate these 
two cases. The per-locus columns are likewise scaled to 
the 0-1 range before entering the NMF method. The 



output is similar (Figure ID): H contains basis "gain-loss" 
patterns W contains the weights associated with each pat- 
tern at each locus. The difference is that now rows in the 
matrix H correspond to patterns of correlated changes - 
not patterns of absolute levels. 

Algorithm properties 

To illustrate important properties of NMF when applied 
to epigenomic data we ran the "absolute" algorithm on a 
relatively simple publicly available ChlP-seq data set. We 
analyzed 7 histone modifications and one histone variant 
(H2A.Z) mapped by the ENCODE project in the A549 
adenocarcinomic alveolar basal epithelial cell line [43]. 
We focused on regions of TSS -proximal gene bodies since 
they contain epigenetic traces of transcription initiation 
and elongation, and prominently feature all probed marks. 

To illustrate the dependence of c on the factorization 
we ran the "absolute" algorithm with all default parame- 
ters and scanned c values from 3 to 8. First, we quantified 
the average sparsity of matrices H and W using Hoyer s 
formula [34] (Figure 2A). Hoyer 's sparsity takes on val- 
ues between 0 (all vector elements equal) and 1 (single 
non-zero component). We observed that the sparsity of// 
increases linearly up to a knee-point at c = 6, whereas the 
sparsity of W is much lower and has a minimum at c = 6. 
This means that if c is (too) high the H matrix will con- 
tain many rows that have only a single mark with positive 
values. Matrix W contains weights that optimally use all 
codes to reconstruct the observed "chromatin signature" 
at each locus (rows of V). The relatively constant spar- 
sity of W suggests that at most loci multiple basis patterns 
are used and superimposed (Figure 2A). An empirical 
property of NMF is that the higher-rank (large c) solu- 
tions are largely consistent with the lower-rank (small c) 
solutions. For example, in one study involving microarray 
clustering, higher resolution clusters are in general sub- 
sets of lower resolution clusters [30]. To illustrate this for 
basis patterns, we visualized matrices H for c = (3 ... 6) 
(Figure 2B). This showed that codes obtained for higher 
c values are, in general terms, obtained by splitting one 
of the lower resolution codes into two. For example code 
1 at c = 4 is split into code 1 and 5 at c = 5 while 
the latter is further split into code 5 and 6 at c = 6. 
This suggests that for NMF specifying a c which is (too) 
small yields a solution which is consistent with a higher 
(ostensibly correct) rank factorization. This type of stabil- 
ity is particularly useful when analyzing the hierarchical 
dependencies between histone modifications [44]. The 
lower-bound of c is determined by the diversity of histone 
modifications. 

The most important property we would like to high- 
light is that NMF basis patterns are less correlated 
than the input features. Correlation heatmaps are often 
used to reveal patterns of associations between histone 
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modifications e.g. [45,46]. As exemplified in (Figure 3 A) 
these heatmaps show typically little structure beyond 
the general split into marks associated with permissive 
or closed chromatin, which limits their interpretability. 
More importantly, the high correlation between histone 
modifications is problematic for regression and some 
classification methods [47] . High correlation among mul- 
tiple variables in a regression model, referred to as mul- 
ticollinearity, leads to poor interpretability of multiple 
regression slopes [48] . In particular, it is important to test 
for multicoUinearity when attempting to use regression 
coefficients to assess the importance of variables (here lev- 
els of histone modifications and NMF basis patterns). We 
compared correlations of histone modifications to corre- 
lations between basis patterns (Figure 3B) and found that 
codes are remarkably less correlated. According to a rule 
of thumb variables that have correlation coefficients larger 
than 0.8 (Spearman's rank correlation coefficient) should 
not be included together in a single model [49]. Only two 
pairs of NMF codes exceed this threshold and only a single 
code would need to be dropped. This is compared to 11 
pairs of individual epigenetic marks that are exceedingly 
correlated. If all affected marks were dropped, the pruned 
model would contain only three independent variables. In 
Additional file 3: Figure S2 we show an analogous compar- 
ison with the difference that mark levels are calculated at 
promoters and in a different cell type (see Methods). 

Another important feature of the NMF algorithm 
in absolute mode is the similarity of the H matrices 



across cell lines. In (Additional file 4: Figure S3) we 
show the H matrices from human embryonic stem (ES) 
cells (HlESCs), myoblasts (HSMM blasts), and myotubes 
(HSMM tubes) derived from a set of 9 common epige- 
netic marks with c set to 6. In general, rows of each H 
matrix are in no particular order and equivalent codes 
obtained from two or more data sets have to be found 
using (for example) the Munkres assignment algorithm 
[52]. The NNDSVD approach initializes rows of matrix 
H using SVD eigenvectors and indirectly ranks basis pat- 
terns by their variance. This order is likely to be similar 
between different cell lines. We observe that the matrices 
are essentially the same for myotubes and myoblasts and 
only slightly different for HlESCs. This suggest that the 
co-regulation of epigenetic marks is not drastically chang- 
ing during differentiation. To get further insight on the 
complexity epigenetic patterns in terms of combinations 
of the H basis patterns we applied K-means clustering to 
the W matrix (Additional file 5: Figure S4). We clustered 
the W weight matrix corresponding to the c = 6 factor- 
ization from (Figure 2B). The majority of clusters (8) is 
dominated by at most 2 of the 6 codes, which means that 
for the majority of genes a simple weighted sum of two 
codes from (rows from H) is (globally) optimal to recon- 
struct relative levels of epigenetic marks. This should be 
contrasted with the hypothetical case, where most loci 
have highly variable and unique code weight patterns and 
the W matrix displays a second level of combinatorial 
complexity. 
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Case studies 
Regression Poll binding 

In our formulation epigenetic patterns are quantitative Le, 
each locus has a specific non-negative weight for each 
of the basis patterns. This enables us to quantitatively 
link the weights of codes to functional or biochemi- 
cal properties of the underlying loci. To illustrate this 
we tried to predict levels of Pol2 binding at promoters 
of protein-coding genes in human embryonic stem cells 
(HIESC). We compared ridge regression models, which 
either included basis patterns (code-based) or individ- 
ual histone marks (mark-based) as independent variables. 
Levels of histone modifications and Pol2 were calculated 
within 5k kbp window centered at the TSS. 

To obtain the code {H) and weights {W) we applied the 
algorithm in "absolute" mode on 10 histone modifications 
with default parameters and c = 7, The discovered codes 
are shown in (Figure 4A). As expected (see Properties), 
we found that codes are not significantly correlated and 
that all of them should be included in the multiple regres- 
sion (Additional file 3: Figure S2B). On the other hand, six 
pairs of individual epigenetic marks are exceedingly corre- 
lated (Additional file 3: Figure S2A). The primary reason 
why highly correlated variables should not be included in 
a multivariate model is that their beta regression coef- 
ficients become unreliable both in magnitude and sign, 
and thus their biological or physical role is difficult to 
interpret. This suggests that at least three marks out of 
(H3K4me2, H3K4me3, H3K9ac, H3K27ac, H3K79me2) 
should be dropped. Unfortunately, it is not a priori known 



which ones (an alternative method to establish variable 
importance and mitigate effects of multicoUinearity is to 
inspect the output of penalized regression). The mark- 
models performance dropped significantly from = 
0.85 to = 0.70 when we kept two marks that are known 
to be associated with both recruited (H3K4me3), and 
actively transcribing (H3K79me2) Pol2 [20,53]. Although 
the code-based regression includes fewer independent 
variables it has almost the same performance (both = 
0.85) as the mark-based model (Table 1). An inspection of 
the regression slopes (Table 1) and code values (Figure 4A) 
reveals that high weights of code 6 (H3K9ac/H3K27ac) 
and code 1 (H3K4me2/H3K4me3) are most positively 
associated with Pol2 levels, which fully confirms a recent 
study [20]. Due to multicoUinearity, coefficients of the 
mark-based regression are not reliable to rank variable 
importance. For example, the negative beta for H3K4me3 
is inconsistent with numerous reports that link H3K4me3 
to Pol2 binding and transcription [54], which idictates 
overfitting although a penalized regression approach was 
employed. 

To differentiate active transcription from promoter- 
proximal Pol2 pausing we assigned each gene to the basis 
pattern with the highest weight (see Formulation) and 
plotted genes from select codes in the gene expression - 
Pol2 level plane. This projection revealed that genes from 
code 6, featuring most prominently high levels of H3K9ac 
and H3K27ac, have all moderate to high levels of Pol2. 
In contrast, genes associated with code 2, which is dom- 
inated by H3K27me3, have uniformly low levels of Pol2. 



A 

4 



-I 

Z3 



■I 



- __■ I 



I 



% % %3 ' 



H2Az 
H3K27ac 
H3K27me3 
H3K36me3 
H3K4me1 
H3K4me2 
H3K4me3 
H3K79me2 
H3K9ac 
H4K20me1 



i 



B 




E 
o 

c 0.25- 



.Y'^*-.- ••• •.. • 



□codes 
□code 2 



0.25 0.50 0.75 

normalized expression 



Figure 4 Basis patterns at promoters and thieir role in Pol2 recruitment. (A) Basis patterns associated with promoter-regions in ES cells. 
Graphical representation of the H factor matrix of the "absolute" algorithm. Rows correspond to basis patterns (codes), columns to epigenetic marks. 
The height of each bar is equal to the loading of a mark on a code. The codes are sparse i.e. most values are very close to zero. Most marks have 
significant loading (by visual inspection) in one or two codes with the exception of H4K20mel . (B) Codes in the Pol2-expression plane. Each dot 
represents a promoter of a protein-coding gene. The Y-axis represents normalized levels of Pol2 binding; the X-axis is the normalized expression 
level the associated gene. Each genes was assigned a label of the code with the highest weight. Genes assigned to codes 2 and 6 are shown. 
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Table 1 Parameters and performance of mark-based and 



code-based ridge regression models 



Code-based 




Mark-based 




code 


beta 


mark 


beta 


code 1 


13 .48 


H2AZ 


-1 .33 


code 2 


1 . 64 


H3K27ac 


1.28 


code 3 


-24 . 98 


H3K27me3 


-0 .33 


code 4 


12.63 


H3K36me3 


0.73 


code 5 


6 . 55 


H3K4mel 


-0 . 17 


code 6 


32 .24 


H3K4me2 


1 . 17 


code 7 


-2.28 


H3K4me3 


-0 . 80 






H3K79me2 


-0 . 63 






H3K9ac 


3.41 






H4K2 0mel 


0.73 


R2 


0.85 


0.85 




MSE 


0 . 01 




0 . 01 



Ridge regression of Pol2 levels in promoters of protein coding genes, beta - 
regression coefficients; MSE - mean squared error. 



Remarkably, high levels of these activating acetylations are 
not significantly correlated with gene expression, while 
H3K27 tri-methylated genes tend to be expressed at a 
low level. This suggests that high levels of H3K27me3 are 
incompatible with Pol2 binding, and that high levels of 
Pol2 are associated with K3K9ac and H3K27ac at gene 
promoters but not necessarily high gene expression. 

In this example we have shown that quantitative weights 
of the "absolute" basis patterns can be used instead of indi- 
vidual histone modifications levels as independent vari- 
ables in the prediction of Pol2 binding. The code-based 
model had equal performance to the mark-based regres- 
sion, but included a smaller number of independent vari- 
ables and alleviated problems of multicollinearity. Hard 
assignment of genes to codes allowed visualization of 
the regulatory differences in Pol2-recruitment and active 
transcription. 

Classification Poll-bound enhancers vs, promoters 

Polymerase II (Pol2) is known to localize both at promot- 
ers and within intragenic regions. In HI ESC preferential 
association of Pol2 was observed for promoter-distal sites 
enriched for p300, H3K4mel, and H3K27ac [55]. Genes in 
the vicinity of these regulatory regions showed increased 
expression levels, while genes that were activated dur- 
ing differentiation gained Pol2 at close enhancers [56]. 
In differentiated cells Pol2 levels at enhancers have been 
shown to change in response to stimuli and to be asso- 
ciated with H3K4me3 and bidirectional transcription 
[57]. These findings established that enhancers actively 
engaged in transcription are occupied by the polymerase. 



The chromatin patterns of this class of enhancers show 
relatively high levels of H3K4me3 and are more simi- 
lar to patterns at promoters of protein coding genes. We 
decided to test whether Pol2-bound enhancers and Pol2- 
bound promoters can be distinguished based on levels and 
multivariate patterns of epigenetic modifications. 

In the same line of human embryonic cells we divided 
Pol2-enriched regions into two classes. The promoter- 
proximal class was defined as 2 kbp regions centered on 
an Pol2 peak, that overlapped any GENCODE annotated 
TSS site. All remaining 2 kbp sites centered on an Pol2 
peak were classified as promoter-distal. We performed 
the analysis using all promoter and enhancer regions, 
but found the classification was relatively trivial because 
the largest Pol2 peaks are preferentially associated with 
promoter regions [55]. Thus, we challenged the classifi- 
cation algorithm by rerunning the analysis excluding the 
top 20 percent of peaks i.e. those with a very high p- 
value of le-25. First, we compared the overall distribution 
of histone modifications at the promoters and putative 
enhancers. 

We found that some marks showed relatively sim- 
ilar levels (Additional file 6: Figure S5). As expected 
we found substantial H3K4me3 levels at Pol2-bound 
putative enhancers. Strikingly, levels of H3K4mel and 
H3K4me2, which are often associated with poised or 
active enhancers, were markedly higher in TSS -proximal 
sites. On the other hand, H3K27ac, which is associated 
with permissive chromatin, and H4K20mel/H3K79me2, 
which are associated with transcriptional elongation, had 
similar levels at both classes of sites. In agreement with 
recent discoveries, we found that a significant portion of 
intragenic Pol2 sites occurred within "poised" enhancers 
that were enriched for H3K27me3 (Additional file 6: 
Figure S5). Notably, while there were some informative 
level differences, the distributions significantly overlapped 
for the majority of marks. 

To discriminate enhancer from promoter regions we 
first built a series of logistic regression models. The sim- 
plest models ("zero-order" models [23]) included only a 
single independent variable {i.e. the normalized level of 
a single histone modification). These zero-order correla- 
tions directly measure the shared variance between two 
variables, since they reflect the amount of variance in 
the binary outcome variable that is explained by a single 
continuous predictor. In addition a "multivariate" model 
was built that included levels of all marks as predictors. 
An analogous set of zero-order and multivariate models 
was built using NMF codes. This new set of models dif- 
fered from the previous in that they used weights from the 
W matrix rather than levels of individual marks to per- 
form the classification. We applied the "discriminatory" 
algorithm and discovered optimal codes for enhancers 
and promoters independently (Figure IC, Formulation). 
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Intuitively, we attempt to identify codes that are useful 
to encode histone modification levels at enhancers, but 
not promoters (or vice versa). We combined all codes into 
matrix H and re-derived the weight matrix W. Therefore, 
weights for certain codes should discriminate between 
promoters and enhancers. 

We trained all models using 10-fold cross-validation 
and evaluated model performance on 20 percent of the 
observations never used for training (Methods). We found 
that both multivariate models had very good performance 
(Additional file 7: Table SI) as judged by the Matthews 
correlation coefficient (MCC), whereas among zero-order 
models only some code-based regressions showed good 
performance (Area Under the receiver operating char- 
acteristic (ROC) Curve (AUC) in Figure 5, MCC in 
Additional file 8: Table S2). The multivariate code-based 
model outperformed the mark-based model in both per- 
formance measures and achieved an almost perfect score 
AUC 0.97 (Additional file 7: Table SI). The majority of 
the mark-based zero-order models had similar and aver- 
age performance, whereas the AUC scores of zero-order 
code-based models were highly variable (Figure 5). Single 
code models either outperformed all histone modifica- 
tions (with the exception of H3K27me3) or were close to 
the performance of random assignment (Figure 5). Impor- 
tantly, the two best single-code regressions (clO and c8) 
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Figure 5 Classification performance of individual marks and 
codes. Classification performance between TSS-proximal and 
TSS-distal Pol2-bound sites estimated as tine area under the ROC. 
Classification on individual marks or codes was done using L2 
penalized logistic regression and all models were trained using 
10-fold cross validation and the model with the highest performance 
was evaluated on a holdout set of 20 percent observations. Codes 
and their weights were obtained by the "discriminatory" algorithm. 



had significantly better performance than all individual 
histone modifications including H3K27me3. While code 
10 contained H3K27me3 together with H2A.Z, code 8 was 
dominated by marks associated with elongation includ- 
ing H3K36me3 and H4K20mel. This shows that the codes 
had discriminating power beyond that of the best mark 
(H3K27me3 in this case). 

To assess the relative importance of independent vari- 
ables in multivariate regression it is important not to rely 
only on regression coefficients [23]. One approach is to 
compare the ranking and signs of variables from zero- 
order and multivariate models. We found that mark-based 
logistic regressions have incongruent slope estimates. For 
example beta coefficients of three marks change signs 
between the two models. Also the ranking of the beta 
coefficients are not even approximately maintained and 
do not track model AUCs (data not shown). In the mark- 
based case it was difficult to ascertain which histone 
modifications discriminate enhancers from promoters. In 
contrast regression on codes yields models that are eas- 
ier to interpret. Specifically, codes with large zero-order 
coefficients were also relatively important in the multi- 
variate model, which largely maintained the rank-order of 
variables (Additional file 9: Figure S6). Also, codes with 
the largest multivariate coefficients consistently showed 
the best zero-order predictive performance (Figure 5B). 
Several codes have very small zero-order coefficients and 
AUCs, but relatively large multivariate slopes. Likely, these 
codes are not important and could be dropped from the 
multivariate model. 

The code-based approach allowed us to identify which 
patterns of histone modifications discriminate between 
Pol2-bound promoters and enhancers (Figure 6 and 
Additional file 10: Figure S7). Most strikingly, we found 
that promoters and enhancers were separated by codes 
with high levels of H2A.Z. In the context of promoters 
H2A.Z is linked to H3K4me2 and H3K4me3. At enhancers 
H2A.Z frequently co-occurs with H3K27me3 or in a com- 
plex pattern with H3K4mel, H3K9ac, and H3K27ac. This 
explains why the variant on its own is unable to differen- 
tiate sites (Figure 5, Additional file 9: Figure S6A). Recent 
findings on the functional and mechanistic roles of H2A.Z 
allow us to give plausible interpretations of the codes: At 
the TSS H2A.Z levels and close positioning have been 
shown to positively correlate with gene expression and 
P0I2 occupancy [58], also high levels of H3K4me2, and in 
particular H3K4me3, are generally associated with active 
promoters. Hence, code 1 is likely associated with genes 
that are transcriptionally active in the ESC state. H2A.Z 
has been shown to be associated both with poised and 
active enhancers in ESC [59]. It has been proposed to 
act as a general facilitator of genome accessibility due 
to its role in the maintenance of pluripotency and dif- 
ferentiation [39]. The two H2A.Z-loaded enhancer codes 
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Figure 6 Code-based classification of Pol2-bound promoters and enhancers in ES cells. Parameters of a multivariate penalized logistic 
regression model using code variables. Codes and their weights were obtained by the "discriminatory" algorithm. (A) Basis patterns associated with 
the six coefficients with the largest magnitude (B) Coefficients of all basis patterns, where positive and negative values indicate propensity towards 
TSS-proximal and TSS-distal regions, respectively. In total 1 2 codes were trained: 6 on TSS-proximal and 6 on TSS-distal regions. The weights in 1/1/ 
were reconstructed using all 12 basis patterns (Figure IC). 



(10 and 7) seem to reveal this context dependent role 
of H2A.Z. In code 10, H2A.Z is in a repressive con- 
text with H3K27me3 and presumably identifies enhancers 
poised for expression during differentiation while in code 
7 H2A.Z co-occurs with permissive acetylations and the 
base-line enhancer mark H3K4mel [60,61]. Code 7 high- 
lights features of enhancers active in the pluripotent state. 

Surprisingly, both enhancers and promoters, are asso- 
ciated with codes dominated by histone modifications 
associated with transcription elongation. At promoters 
this code is very sparse and contains non-zero values 
for H3K36me3, H4K20mel (both high), and H3K79me2 
(low). At putative enhancers the code is slightly different 
as it does not contain H3K79me2, but includes H3K9ac 
and H3K27ac at low levels. Recently it was shown that 
H3K79me2 is most enriched at 5' ends of genes, slightly 
downstream of H3K4me3, but before the classic elonga- 
tion associated mark H3K36me3 [62]. Thus, it is expected 
to occur at promoter proximal Pol2-bound regions. To the 
contrary, active enhancers are sometimes found in introns 
of transcribed genes. This dependency between active 
transcription and activation of enhancers in the gene body 
appears to be captured via code 7. 

In total, these results suggest that the discovered 
basis patterns capture dependencies between marks that 
discriminate Pol2-bound enhancers or promoters. The 



factorization approach successfully de-correlated epige- 
netic marks, which resulted in an interpretable multi- 
variate classification model. Further, the discovered codes 
are consistent with known epigenetic mechanisms and 
features that regulate Pol2-dependent transcription in 
pluripotent cells. 

Gene set enrichment analysis 

In the previous analysis we have compared "absolute" lev- 
els of histone modifications at multiple classes of loci to 
discover patterns of co-occurring marks that discriminate 
among them. Somewhat analogously, histone modifica- 
tion levels can be compared between two experimental 
conditions. Intuitively, the idea is that patterns of co- 
occurring changes to mark levels could be used to identify 
loci that are subjected to coordinated epigenetic regu- 
lation. Differentiation is a highly regulated process and 
specific reprogramming mechanisms could result in sim- 
ilar epigenetic changes at functionally related genes. In 
other words, genes that share combinatorial patterns of 
changes could have some common molecular functions or 
participate in related pathways. 

To test this hypothesis we applied the "differential" algo- 
rithm (see Formulation) to histone modification data in 
myoblasts and myo tubes. The alignment of myoblasts into 
myotubes represents an important step in myogenesis 
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and is an example of differentiation. Thus, by comparing 
the epigenetic state of myotubes to myoblasts we hope 
to discover patterns of epigenetic changes that are asso- 
ciated with genes that undergo coordinated epigenetic 
reprogramming during myogenesis [63] . We obtain basis 
"gain-loss" patterns from the observed epigenetic changes 
at promoters of protein coding genes (see Formulation 
and Methods). The number of codes represents a signifi- 
cant reduction from the 24 input variables - the 12 gains 
and 12 losses of epigenetic marks (Figure 7). Still, the 
codes are relatively sparse and most variables take signif- 
icant values in only a small number of codes. It should 
be noted that the codes are not gain- or loss-specific and 
gains of certain marks are linked to losses of other marks. 
For example, one pattern highlights H3K4me3 and H2A.Z 
loss linked to an increase in DNase I accessibility and gain 
of H3K4me2 (code 4, Figure 7). 

We tested whether any of the basis patterns are enriched 
for Gene Ontology (GO) terms, biochemical pathways, 
or experimental molecular signatures [64]. Specifically, 
we evaluated the strength of positive association between 



the weights of each code at each locus with functional 
annotations of the underlying loci (see Methods). We 
have chosen the random-set method [65] to quantify 
the enrichment, but other methods including the Gene 
Set Enrichment Analysis (GSEA) [66] and the Fishers 
exact test are equally applicable. Surprisingly, we found 
only one statistically significant (after False Discovery 
Rate (FDR) correction) GO term: code 5 was found to 
be associated with genes involved in the cell cycle (p- 
value: 6.20e-17). Several codes are significantly associated 
with specific pathways (Table 2). Broadly, membrane pro- 
teins and specifically G protein-coupled receptors (GPCR) 
have increased weights of the repressive code 2; code 
3 is linked to genes involved in transcription; splicing 
and translation; finally, genes involved in the cell cycle 
are, again, associated with higher levels of code 5. These 
pathways are critical to myogenesis. During differentia- 
tion myoblasts exit the cell cycle and increase protein 
synthesis to expand the myofibrillar muscle cell com- 
partment [67]. Fully mature muscle cells express tens of 
different GPCR receptors [68]. Many of which have been 
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Figure 7 "gain-loss" codes of epigenetic reprogramming during differentiation. Graphical representation of tine H factor matrix of tine 
"differential" algorithm. Rows correspond to "gain-loss" basis patterns, columns to both gains and losses of epigenetic marks. The height of each bar 
is equal to the loading of a mark gain or loss on a code. Columns are grouped based on the direction of epigenetic change ("gain" or "loss"). Within a 
single code it is rare to observe both a significant "gain" and "loss" of the same mark. Most mark changes have significant loading in two codes with 
the exception of H3K9me3 gain and H3K27me3 loss. The majority of patterns combines "gains" and "losses"of multiple marks. 
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Table 2 Statistical association of epigenetic remodelling patterns and molecular pathways during myoblast 
differentiation 
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For each of the "gain-loss" basis patterns only the most significant (FDR-corrected) terms are shown for each code. Pathway gene annotations and identifiers are from 
MSigDB, but are originally sourced from KEGG and Reactome. Codes that did not have any significant terms at the 1 E-6 level were omitted. 



shown to enable muscle function by regulating growth, 
contractility, and glucose uptake. However, the genes asso- 
ciated with code 2 (neuroactive (p: 1.57e-43) and olfactory 
(p: 2.09e-17) GPCRs) are transcriptionally silenced as cells 
transition from myoblasts to myotubes. Among molecu- 
lar signatures we found over 350 highly enriched terms 
(p: le-10) (Additional file 11). The most significant associ- 
ation was between targets of E2F4 and the "gain-loss" code 
5 (p: 8.54e-156). E2F4 is a transcriptional regulator with 
a specific role in the repression of cell-cycle genes and 
ability to recruit HDACl -containing co-repressor com- 
plexes [69] . It is notable that the most prominent feature 
of code 5 is the loss of H3K9ac and H3K27ac. Although 
HDACs have relatively low substrate specificity, which is 
dependent on their co-factors, HDACl has been recently 
implicated in the specific deacetylation of H3K9 [70]. 
A prediction based on this analysis is that H2A.Z lev- 
els increase at cell cycle genes repressed during myoblast 
differentiation. Taken together these results suggest that 
during myogenesis distinct patterns of net gains or losses 
of epigenetic marks are associated with functional classes 
of genes. 



Myoblasts are embryonic progenitor cells with myo- 
genic potential. They are more differentiated than ES 
cells, but markedly less than myotubes. Expression of 
pluripotency factors can either abrogate differentiation 
into myotubes or elicit reprogramming of myoblasts into 
induced pluripotent stem cells (iPSC) [71,72]. We hypoth- 
esized that targets of pluripotency factors which are 
silenced during differentiation will share an epigenetic 
signature of their repression. We found that experimen- 
tal targets of several pluripotency factors, including MYC 
(c-Myc), NANOG, POU5F1 (Oct4), and SOX2, are all 
strongly associated with the repressive "gain-loss" code 
5 (Additional file 11). Further significant enrichments 
were observed for other "pluripotent" gene categories 
including the protein-protein interaction network shared 
among pluripotent cells (PluriNet) [73], and the core 
ESC-like module [74], which includes genes coordinately 
up-regulated in a compendium of ESCs. We decided 
to test whether these enrichments are due to the spe- 
cific combinatorial pattern of gains and losses captured 
by code 5, or alternatively could be explained by any 
of the individual marks (Figure 8). We found that each 
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Myc/Max Nanog Oct4 Sox2 PluriNet Stem 

Figure 8 Epigenetic reprogramming of the pluripotency 
network. Association of liistone modification clianges and NMF 
codes witli genes involved in pluripotency. Myc/Max, Nanog, Oct4, 
Sox2 are shorthand for experimental targets of these TFs in human 
embryonic stem cells. PluriNet are genes involved in the 
protein-protein interactions networks discovered in pluripotent cells. 
Stem are genes up-regulated in embryonic stem cells shared 
between mouse and human. The five marks or codes most 
significantly (p-value) associated with each group of genes are shown. 



category of pluripotent genes is more strongly associ- 
ated with code 5 than with any of the single epigenetic 
marks. Further, the association of other epigenetic marks 
within the six categories was inconsistent and no sin- 
gle mark could be chosen as a strong proxy for code 5. 
This suggests that promoters of genes that maintain the 
pluripotent state are epigenetically silenced in a coor- 
dinated way and that this is captured by one of the 
"gain-loss" epigenetic "codes". Other patterns of silenc- 
ing i.e. codes 4 and 6 which display loss of H3K4me3/ 
H2A.Z and H3K4me2/H3K4mel/H2A.Z, respectively 
are not consistently associated with the pluripotency 
categories. 

Discussion 

We have introduced a new computational technique 
to discover combinatorial patterns of histone modifica- 
tions. At its core this method relies on non-negative 
matrix factorization (NMF) to separate the complex 
"chromatin signatures" at genomic loci into small basis 
patterns we refer to as codes. These simple parametriza- 
tions of the data reveal frequently co-occurring marks, 
which could potentially be read by multivalent chromatin 



complexes, or represent differential signatures of coordi- 
nated epigenetic reprogramming. Most importantly, the 
application of NMF results in dimensionality reduction 
and de-correlation, but maintains the quantitative aspect 
of epigenetic mark levels. 

NMF is one among many matrix factorization algo- 
rithms. Results from alternative methods are different due 
to the difference in the imposed factorization constraints 
and objectives. Principal component analysis (PC A) con- 
strains // to be a set of orthonormal vectors; vector 
quantization (VQ), which is equivalent to K-means, con- 
strains W to contain vectors with one non-zero value; 
while NMF imposes that W and H are non-negative. 
These constraints result in fundamentally different out- 
puts. PCA favors global reconstruction, which means 
that every element in V is reconstructed through com- 
plex cancellations of positive and negative values in W 
and H. PCA allows basis vectors (principal component, 
PC) in H to be ranked by importance. The reconstruc- 
tion error increases when the least important PC is 
omitted, but the "coarse" global features of input data 
are preserved. On the other hand NMF basis vectors 
cannot be dropped, since it would result in the loss 
of important parts of (a subset of) the reconstructed 
vectors. 

While PCA dimensions do not resemble any particu- 
lar data point or combination of data points, NMF basis 
vectors can be readily interpreted as patterns of fre- 
quently co-occurring histone modifications. Only a small 
number of these codes is sufficient to reliably recon- 
struct the observed "chromatin signatures" at thousands 
of loci and, as shown by our analyses, to preserve, or 
even boost, biological information. Although with respect 
to the mean squared error (MSE) PCA is theoretically 
optimal for reconstruction, NMF can perform better for 
classification [75] or recognition [76]. In some sense NMF 
returns results that are in between PCA and VQ. In 
VQ each data point is locally approximated by a sin- 
gle cluster centroid, PCA uses all available components, 
while in NMF typically few, but not all, basis vectors are 
required to represent a single data point. If the goal is 
to assign loci to epigenomic states a form of clustering is 
preferred as cluster centroids are often intuitively under- 
stood. PCA will perform best if the number of histone 
modifications is large but one desires only few basis vec- 
tors (principal components). As illustrated in this paper 
NMF basis vectors perform well in supervised machine 
learning. An alternative and analogous approach, known 
as principal components regression, is to use weights of 
principal components instead of weights of NMF basis 
vectors. An advantage of NMF is the physical inter- 
pretability of the its basis vectors. Conversely, the optimal 
reconstruction error of PCA might be important for very 
simple models. 
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Conclusions 

We have shown that NMF applied to epigenetic marks 
yield sparse codes with an important nesting property. 
Further we have demonstrated the benefits of using codes 
over individual marks in predictive modeling of Pol2- 
binding. In particular, dimensions obtained from NMF are 
less correlated than the individual marks, and problems 
resulting from multicoUinearity are alleviated. In addition 
we developed two variants of the basic algorithm which 
extended its applicability to multiple classes of genomic 
regions and paired experimental samples. We have shown 
the excellent performance of codes for the classification 
of Pol2-bound enhancers and promoters. The most dis- 
criminatory codes highlighted the context-dependence of 
H2A.Z, which is consistent with current knowledge on the 
role of this histone variant in the regulation of transcrip- 
tion in ES cells [59] . To showcase the algorithm for paired 
experimental samples, we analyzed chromatin remodel- 
ing during myogenesis. We established that genes from 
pathways involved in protein synthesis (anabolism), the 
cell cycle, and signaling from G protein-coupled receptors 
show unique patterns of chromatin activation or silencing. 
Finally, we were able to show that target genes of pluripo- 
tency factors are also associated with the same chromatin 
remodeling pattern [77]. 

In summary, we have introduced a general NMF-based 
approach to represent combinatorial patterns of epige- 
netic marks as quantitative variables. We have shown 
the utility of this representation for predictive model- 
ing, supervised machine learning and gene set analysis. 
Hence, this technique is complementary to more descrip- 
tive methods aimed at "chromatin pattern" discovery such 
as genome-wide segmentation and clustering. 

Methods 

Implementation 

All three variants of the presented NMF-based algorithm 
are provided as the epicode open-source software pack- 
age. The software provides all that is required to dis- 
cover basis patterns from aligned sequencing data and 
sets of user-provided reference regions. Epicode pro- 
vides three modes of operation: "absolute" and "discrim- 
inatory" and "differential". In the "absolute" mode the 
user is expected to provide a set of genomic loci (in a 
UCSC Browser Extensible Data (BED) file) of interest and 
aligned sequencing data for a single experimental con- 
dition (in Binary sequence Alignment/Map (BAM) files). 
The regions can be global such as promoters of protein 
coding genes or specific subsets e.g. "putative enhancers of 
expressed miRNAs". The input sequencing data are typ- 
ically histone modifications mapped in a single cell line 
and experimental condition. In the "discriminatory" mode 
the user provides two sets of loci e.g. enhancers and pro- 
moters. The "differential" mode requires a single set of 



genomic regions, but two sets of sequencing data, which 
correspond to the same marks mapped in two conditions 
or cell lines. 

We have implemented epicode as a Python 2.7 software 
package, and also provide a command-line executable. 
The code should run on UNIX-like operating systems and 
has been tested on Linux (Arch, RHEL 6). Dependencies: 
several Python packages are required by epicode, includ- 
ing NumPy [78], SciPy [79], scikit-learn [80], and pysam. 
Input formats: the tool is designed to work with stan- 
dard file formats. Reference genomic sites are expected in 
the BED6+ file format [81]. Sequencing data is read from 
coordinate sorted BAM files [82]. Output: Results are 
reported in a machine-readable tab-delimited file format. 
Scripts in the R language are provided to generate pub- 
lication quality figures from one of the output files. The 
current implementation of Epicode is lO-bound mean- 
ing that the majority of time is spent in reading the BAM 
files. The factorization takes typically less than 5 minutes 
on a single Intel(R) Xeon(R) CPU E5-1620 0 3.60 GHz 
core. Reading the BAM files takes up-to 30 minutes using 
four cores and strongly depends on the hard-drive speed. 
6. URL and license for software should be mentioned in 
manuscript. The software is freely available (MIT license) 
at https://github.com/mcieslik-mctp/epicode. 

Throughout the manuscript epicode has been used with 
all default parameters (as of version 1.0), with the excep- 
tion of the "differential" algorithm for which a step of 50 
was chosen. 

Data normalization 

To construct matrix V from sorted BAM files, individual 
reads are counted within regions (lines) of the user pro- 
vided BED files. Each read that overlaps the target region 
is counted towards that region. All columns (marks) of 
V are scaled from 0 to 1 using a sigmoid function such 
that the mapping is approximately linear up to the 95th 
percentile [36]. 

X = 2/(1 + - 1 

Here, u is the 95th percentile of the values in vector 
X and X is the scaled vector. The scaling is done before 
factorization. In "differential" mode enrichment signals 
(Figure ID) are windowed and corrected for sequencing- 
depth (i.e., normalized) using the provided Python imple- 
mentation of the DESeq algorithm [42]. After subtraction 
the window scores are summed to overall "gain" (posi- 
tive integral) and "loss" scores (negative integral) for each 
locus. "Gain-loss" scores are likewise sigmoid scaled. 

Enrichment analysis 

Associations of functional gene sets with mark levels 
and basis pattern weights were done using the random- 
set method [65]. (an implementation of the random-set 



Cieslikand Bek\rano\/ BMC Genomics 20^ 4, 15:76 
http://www.bionnedcentral.conn/l 471 -21 64/1 5/76 



Page 15 of 18 



method is included in the source-code distribution of epi- 
code). Annotations for Ensembl genes were obtained from 
MSigDB [64] (msigdb.vS.l.entrez.gmt) and re-mapped 
from Entrez gene ids (EG) onto GENCODE V14 genes 
[83] using identifier maps (EG to ENSG) from Ensembl 
(current as of Apr 20th 2013). Association p-values 
(obtained from the random-set method) were FDR- 
corrected (BH-method [84]) over the whole 8513 terms in 
the MSigDB database (which is more stringent), but we 
reported on associations from different classes of MSigDB 
gene sets individually, since different types of gene sets 
have different distributions of association p-values (exper- 
imental gene sets are typically more closely correlated 
than literature-derived). 

Data sources 

All raw sequencing data used in the case studies were 
downloaded from the ENCODE project website as 
FASTQ files. We included all available histone modifica- 
tion data sets for four cell lines A549, HIESC, HSMM, 
and HSMMT, with the exception of H3K36me3 in A549 
because of poor reproducibility of this dataset between 
replicates. Additional Pol2 (ChlP-seq), expression (RNA- 
seq), and DNase accessibility (Digital Genomic Footprint- 
ing (DGF) and DNase-seq) data sets were downloaded for 
HIESC. In the case of histone ChlP-seq and DNase acces- 
sibility, reads from multiple replicates (BAM files) were 
combined into a single BAM file using samtools merge. 
List of all analyzed files in included in in Additional file 12. 

Data processing 

We used Bowtie2 [85] with all default settings and indexes 
for the HG19 genome build (ftp://ftp.ccb.jhu.edu/pub/ 
data/bowtie2_indexes/hgl9.zip) for all alignments. To 
count exonic RNA-seq reads we used the HTSeq tool [42] 
with default settings on the GENCODE-provided General 
Transfer Format (GTF) file. To estimate expresssion levels, 
read-counts for each gene were normalized by total exon 
length, averaged over replicate samples, and finally scaled 
to the 0 to 1 range using the same sigmoid function. 

Supervised machine learning 

Predictive modeling (ridge regression, penalized logistic 
regression) was done using scikit-learn. All model param- 
eters, including penalty type (11' or 12') and regularization 
strength C (1, 2, 5, 10, 50, 100, 500), were trained using 
10-fold cross-validation. All cross-validated models used 
12' penalty and C = 1. Models were evaluated on 20 per- 
cent, using scripts included in scikit-learn, on hold-out 
data which was never used for training or cross-validation. 

Evaluation of initialization methods 

Three initialization methods were evaluated NNDSVD, 
random, and randomized NNDSVD (NNDSVDar). In the 



random initialization both W and H matrices are filled 
with random uniform numbers (0 to 1 range). In the 
NNDSVDar only zero elements (after NNDSVD) are set 
to small (close to 0) random numbers. The NNDSVD 
approach is deterministic and is described in detail in 
[33]. To evaluate similarity between two H matrices we 
use the Munkres algorithm to establish the minimum cost 
assignment. To find this minimum it is necessary to pair 
the most similar rows. Similarity of a pair (cost) is eval- 
uated based on the Euclidean distance. The minimum- 
cost assignment of basis vector pairs is found using the 
Munkres algorithm [52] i.e. a set of pairs is found that 
minimizes the global cost. We calculate sparsity using 
Hoyer's formula. 

Additional files 



Additional file 1 : Python source-code archive. 

Additional file 2: Figure S1 Reconstruction error of NMF runs based 
on random initializations. Reconstruction error of 1000 NMF runs plotted 
as a function of the similarity of the factorization to the reference matrix H 
obtained using NNDSVD. The factor matrices are initialized using random 
positive numbers. Similarity between two H matrices is obtained by 
calculating the minimum euclidean distance between their basis vectors 
(see Methods). The approximately linear trend shows that solutions that 
are most similar to NNDSVD have the smallest reconstruction error. Very 
few solution with a small reconstruction error are dissimilar to the NNDSVD 
output. 

Additional file 3: Figure S2 Correlation of marks and codes within 
promoter regions Heat maps of Spearman's rank correlation coefficients 
between marks and NMF basis patterns (codes) in bodies of protein coding 
genes (marks were mapped in A549 cells). The rows and columns of both 
heatmaps were sorted using Ward's clustering and the cosine metric. None 
of the code-correlations exceeds 0.6, which suggests that all variables 
should be included in the regression. 

Additional file 4: Figure S3 Universality of the H matrix. Graphical 
representation of H matrices derived for cell types at various levels of 
differentiation. Levels of epigenetic marks were quantified at promoters (1 
kbp windows which include 900 bp upstream and 100 bp downstream of 
the TSS) of protein coding GENCODE genes. The "absolute" algorithm was 
applied fore = 6 with all standard settings. HI ESC - HI human embryonic 
stem cells, HSMM blast - myoblast cells, HSMM tube - myotube cells. 

Additional file 5: Figure S4 Clustering of the W matrix. K-means 
clustering heatmap of the W matrix. The K-means algorithm was applied 
for /( = 1 2 to the code weight matrix corresponding to (Figure 2B, c = 6) 
using Euclidean distance and median centroids. Clusters were ordered 
according to a hierarchical clustering of their medians. 

Additional file 6: Figure S5 Epigenetic mark levels at TSS-proximal 
and TSS-distal Pol2-bound sites. Box plots of sigmoid-scaled levels of 
histone modifications at 2 kbp sites centered around a Pol2-peak summit, 
(up) TSS-proximal sites overlapping a TSS site known to GENCODE. 
(bottom) TSS-distal sites. Boxes indicate medians and 25th and 75th 
percentiles. Whiskers extend to 1 .5 time the interquartile range (IQR) or 
roughly the 95th percentile. 

Additional file 7: Table S1 Classification performance of mark-based 
and code-based logistic regression in the classification of Pol2-bound 
sites. 

Additional file 8: Table S2 Parameters of penalized logistic 
regression models: supervised classification of Pol2-bound 
TSS-proximal and TSS-distal sites. 

Additional file 9: Figure S6 Coefficients of penalized logistic 
regressions for the classification of TSS-proximal and TSS-distal 
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Pol2-bound sites. (A) Bar charts of regression coefficier^ts for multivariate 
(top) and zero-order (bottom) mark based logistic regressions models. (B) 
Bar charts of regression coefficients for multivariate (top) and zero-order 
(bottom) code-based logistic regressions models. 

Additional file 1 0: Figure S7 Code-dependent distribution of 
TSS-proximal and TSS-distal Pol2-bound sites. Each input site was 
assigned to the discriminatory epigenetic code for which it had the highest 
loading. For each code the number of TSS-proximal and TSS-distal 
Pol2-bound sites is plotted. 

Additional file 1 1 : Statistical association of epigenetic remodelling 
patterns and molecular signatures during myoblast differentiation. 

Additional file 1 2: List of analyzed datasets. 
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